81 research outputs found

    Global disease monitoring and forecasting with Wikipedia

    Full text link
    Infectious disease is a leading threat to public health, economic stability, and other key social structures. Efforts to mitigate these impacts depend on accurate and timely monitoring to measure the risk and progress of disease. Traditional, biologically-focused monitoring techniques are accurate but costly and slow; in response, new techniques based on social internet data such as social media and search queries are emerging. These efforts are promising, but important challenges in the areas of scientific peer review, breadth of diseases and countries, and forecasting hamper their operational usefulness. We examine a freely available, open data source for this use: access logs from the online encyclopedia Wikipedia. Using linear models, language as a proxy for location, and a systematic yet simple article selection procedure, we tested 14 location-disease combinations and demonstrate that these data feasibly support an approach that overcomes these challenges. Specifically, our proof-of-concept yields models with r2r^2 up to 0.92, forecasting value up to the 28 days tested, and several pairs of models similar enough to suggest that transferring models from one location to another without re-training is feasible. Based on these preliminary results, we close with a research agenda designed to overcome these challenges and produce a disease monitoring and forecasting system that is significantly more effective, robust, and globally comprehensive than the current state of the art.Comment: 27 pages; 4 figures; 4 tables. Version 2: Cite McIver & Brownstein and adjust novelty claims accordingly; revise title; various revisions for clarit

    Google Health Trends performance reflecting dengue incidence for the Brazilian states

    Get PDF
    This work is licensed under a Creative Commons Attribution 4.0 International License.Background Dengue fever is a mosquito-borne infection transmitted by Aedes aegypti and mainly found in tropical and subtropical regions worldwide. Since its re-introduction in 1986, Brazil has become a hotspot for dengue and has experienced yearly epidemics. As a notifiable infectious disease, Brazil uses a passive epidemiological surveillance system to collect and report cases; however, dengue burden is underestimated. Thus, Internet data streams may complement surveillance activities by providing real-time information in the face of reporting lags. Methods We analyzed 19 terms related to dengue using Google Health Trends (GHT), a free-Internet data-source, and compared it with weekly dengue incidence between 2011 to 2016. We correlated GHT data with dengue incidence at the national and state-level for Brazil while using the adjusted R squared statistic as primary outcome measure (0/1). We used survey data on Internet access and variables from the official census of 2010 to identify where GHT could be useful in tracking dengue dynamics. Finally, we used a standardized volatility index on dengue incidence and developed models with different variables with the same objective. Results From the 19 terms explored with GHT, only seven were able to consistently track dengue. From the 27 states, only 12 reported an adjusted R squared higher than 0.8; these states were distributed mainly in the Northeast, Southeast, and South of Brazil. The usefulness of GHT was explained by the logarithm of the number of Internet users in the last 3 months, the total population per state, and the standardized volatility index. Conclusions The potential contribution of GHT in complementing traditional established surveillance strategies should be analyzed in the context of geographical resolutions smaller than countries. For Brazil, GHT implementation should be analyzed in a case-by-case basis. State variables including total population, Internet usage in the last 3 months, and the standardized volatility index could serve as indicators determining when GHT could complement dengue state level surveillance in other countries

    Epidemiological data challenges: planning for a more robust future through data standards

    Get PDF
    Accessible epidemiological data are of great value for emergency preparedness and response, understanding disease progression through a population, and building statistical and mechanistic disease models that enable forecasting. The status quo, however, renders acquiring and using such data difficult in practice. In many cases, a primary way of obtaining epidemiological data is through the internet, but the methods by which the data are presented to the public often differ drastically among institutions. As a result, there is a strong need for better data sharing practices. This paper identifies, in detail and with examples, the three key challenges one encounters when attempting to acquire and use epidemiological data: 1) interfaces, 2) data formatting, and 3) reporting. These challenges are used to provide suggestions and guidance for improvement as these systems evolve in the future. If these suggested data and interface recommendations were adhered to, epidemiological and public health analysis, modeling, and informatics work would be significantly streamlined, which can in turn yield better public health decision-making capabilities.Comment: v2 includes several typo fixes; v3 adds a paragraph on backfill; v4 adds 2 new paragraphs to the conclusion that address Frontiers reviewer comments; v5 adds some minor modifications that address additional reviewer comment

    Forecasting the 2013--2014 Influenza Season using Wikipedia

    Full text link
    Infectious diseases are one of the leading causes of morbidity and mortality around the world; thus, forecasting their impact is crucial for planning an effective response strategy. According to the Centers for Disease Control and Prevention (CDC), seasonal influenza affects between 5% to 20% of the U.S. population and causes major economic impacts resulting from hospitalization and absenteeism. Understanding influenza dynamics and forecasting its impact is fundamental for developing prevention and mitigation strategies. We combine modern data assimilation methods with Wikipedia access logs and CDC influenza like illness (ILI) reports to create a weekly forecast for seasonal influenza. The methods are applied to the 2013--2014 influenza season but are sufficiently general to forecast any disease outbreak, given incidence or case count data. We adjust the initialization and parametrization of a disease model and show that this allows us to determine systematic model bias. In addition, we provide a way to determine where the model diverges from observation and evaluate forecast accuracy. Wikipedia article access logs are shown to be highly correlated with historical ILI records and allow for accurate prediction of ILI data several weeks before it becomes available. The results show that prior to the peak of the flu season, our forecasting method projected the actual outcome with a high probability. However, since our model does not account for re-infection or multiple strains of influenza, the tail of the epidemic is not predicted well after the peak of flu season has past.Comment: Second version. In previous version 2 figure references were compiling wrong due to error in latex sourc

    The Biosurveillance Analytics Resource Directory (BARD): Facilitating the Use of Epidemiological Models for Infectious Disease Surveillance

    Get PDF
    Epidemiological modeling for infectious disease is important for disease management and its routine implementation needs to be facilitated through better description of models in an operational context. A standardized model characterization process that allows selection or making manual comparisons of available models and their results is currently lacking. A key need is a universal framework to facilitate model description and understanding of its features. Los Alamos National Laboratory (LANL) has developed a comprehensive framework that can be used to characterize an infectious disease model in an operational context. The framework was developed through a consensus among a panel of subject matter experts. In this paper, we describe the framework, its application to model characterization, and the development of the Biosurveillance Analytics Resource Directory (BARD; http://brd.bsvgateway.org/brd/), to facilitate the rapid selection of operational models for specific infectious/communicable diseases. We offer this framework and associated database to stakeholders of the infectious disease modeling field as a tool for standardizing model description and facilitating the use of epidemiological models

    Defending Our Public Biological Databases as a Global Critical Infrastructure

    Get PDF
    Progress in modern biology is being driven, in part, by the large amounts of freely available data in public resources such as the International Nucleotide Sequence Database Collaboration (INSDC), the world's primary database of biological sequence (and related) information. INSDC and similar databases have dramatically increased the pace of fundamental biological discovery and enabled a host of innovative therapeutic, diagnostic, and forensic applications. However, as high-value, openly shared resources with a high degree of assumed trust, these repositories share compelling similarities to the early days of the Internet. Consequently, as public biological databases continue to increase in size and importance, we expect that they will face the same threats as undefended cyberspace. There is a unique opportunity, before a significant breach and loss of trust occurs, to ensure they evolve with quality and security as a design philosophy rather than costly “retrofitted” mitigations. This Perspective surveys some potential quality assurance and security weaknesses in existing open genomic and proteomic repositories, describes methods to mitigate the likelihood of both intentional and unintentional errors, and offers recommendations for risk mitigation based on lessons learned from cybersecurity

    Results from the centers for disease control and prevention's predict the 2013-2014 Influenza Season Challenge

    Get PDF
    Background: Early insights into the timing of the start, peak, and intensity of the influenza season could be useful in planning influenza prevention and control activities. To encourage development and innovation in influenza forecasting, the Centers for Disease Control and Prevention (CDC) organized a challenge to predict the 2013-14 Unites States influenza season. Methods: Challenge contestants were asked to forecast the start, peak, and intensity of the 2013-2014 influenza season at the national level and at any or all Health and Human Services (HHS) region level(s). The challenge ran from December 1, 2013-March 27, 2014; contestants were required to submit 9 biweekly forecasts at the national level to be eligible. The selection of the winner was based on expert evaluation of the methodology used to make the prediction and the accuracy of the prediction as judged against the U.S. Outpatient Influenza-like Illness Surveillance Network (ILINet). Results: Nine teams submitted 13 forecasts for all required milestones. The first forecast was due on December 2, 2013; 3/13 forecasts received correctly predicted the start of the influenza season within one week, 1/13 predicted the peak within 1 week, 3/13 predicted the peak ILINet percentage within 1 %, and 4/13 predicted the season duration within 1 week. For the prediction due on December 19, 2013, the number of forecasts that correctly forecasted the peak week increased to 2/13, the peak percentage to 6/13, and the duration of the season to 6/13. As the season progressed, the forecasts became more stable and were closer to the season milestones. Conclusion: Forecasting has become technically feasible, but further efforts are needed to improve forecast accuracy so that policy makers can reliably use these predictions. CDC and challenge contestants plan to build upon the methods developed during this contest to improve the accuracy of influenza forecasts. © 2016 The Author(s)

    Parametric Uncertainty in Intra-Herd Foot-and-Mouth Disease Epidemiological Models

    Get PDF
    Epidemiological models that simulate the spread of Foot-and-Mouth Disease within a herd are the foundation of decision support tools used by governments to help advise and inform strategy to combat outbreaks. Contact transmission data used to parameterize these models, contrary to assumption, contain a significant amount of variability and uncertainty. The implications of this finding suggest that the resultant model output might not accurately simulate the spread of an outbreak. If this is true, the potential impact due to uncertainty inherent to the decision support tools used by governments might be significant

    Epi Archive: Automated Synthesis of Global Notifiable Disease Data

    Get PDF
    ObjectiveAutomatically collect and synthesize global notifiable disease data and make it available to humans and computers. Provide the data on the web and within the Biosurveillance Ecosystem (BSVE) as a novel data stream. These data have many applications including improving the prediction and early warning of disease events.IntroductionGovernment reporting of notifiable disease data is common and widespread, though most countries do not report in a machine-readable format. This is despite the WHO International Health Regulations stating that “[e]ach State Party shall notify WHO, by the most efficient means of communication available.” 1Data are often in the form of a file that contains text, tables and graphs summarizing weekly or monthly disease counts. This presents a problem when information is needed for more data intensive approaches to epidemiology, biosurveillance and public health. While most nations likely store incident data in a machine-readable format, governments can be hesitant to share data openly for a variety of reasons that include technical, political, economic, and motivational2.A survey conducted by LANL of notifiable disease data reporting in over fifty countries identified only a few websites that report data in a machine-readable format. The majority (>70%) produce reports as PDF files on a regular basis. The bulk of the PDF reports present data in a structured tabular format, while some report in natural language or graphical charts.The structure and format of PDF reports change often; this adds to the complexity of identifying and parsing the desired data. Not all websites publish in English, and it is common to find typos and clerical errors.LANL has developed a tool, Epi Archive, to collect global notifiable disease data automatically and continuously and make it uniform and readily accessible.MethodsA survey of the national notifiable disease reporting systems is periodically conducted notating how the data are reported and in what formats. We determined the minimal metadata that is required to contextualize incident counts properly, as well as optional metadata that is commonly found.The development of software to regularly ingest notifiable disease data and make it available involves three to four main steps: scraping, detecting, parsing and persisting.Scraping: we examine website design and determine reporting mechanisms for each country/website, as well as what varies across the reporting mechanisms. We then design and write code to automate the downloading of data for each country. We store all artifacts presented as files (PDF, XLSX, etc.) in their original form, along with appropriate metadata for parsing and data provenance.Detecting: This step is required when parsing structured non-machine-readable data, such as tabular data in PDF files. We combine the Nurminen methodology of PDF table detection with in-house heuristics to find the desired data within PDF reports3.Parsing: We determine what to extract from each dataset and parse these data into uniform data structures, correctly accommodating the variations in metadata (e.g., time interval definitions) and the various human languages.Persisting: We store the data in the Epi Archive database and make it available on the internet and through the BSVE. The data is persisted into a structured and normalized SQL database.ResultsEpi Archive currently contains national and/or subnational notifiable disease data from thirty-nine nations. When a user accesses the Epi Archive site, they are able to peruse, chart and download data by country, subregion, disease and time interval. Access to a cached version of the original artifacts (e.g. PDF files), a link to the source and additional metadata is also available through the user interface. Finally, to ensure machine-readability, the data from Epi Archive can be reached through a REST API. http://epiarchive.bsvgateway.org/ConclusionsLANL, as part of a currently funded DTRA effort, is automatically and continually collecting global notifiable disease data. While thirty-nine nations are in production, more are being brought online in the near future. These data are already being utilized and have many applications, including improving the prediction and early warning of disease events.References[1] WHO International Health Regulations, edition 3. http://apps.who.int/iris/bitstream/10665/246107/1/9789241580496-eng.pdf[2] van Panhuis WG, Paul P, Emerson C, et al. A systematic review of barriers to data sharing in public health. BMC Public Health. 2014. 14:1144. doi:10.1186/1471-2458-14-1144[3] Nurminen, Anssi. "Algorithmic extraction of data in tables in PDF documents." (2013).
    • …
    corecore